Massively Parallel Data Analysis with PACTs on Nephele

نویسندگان

  • Alexander Alexandrov
  • Dominic Battré
  • Stephan Ewen
  • Max Heimel
  • Fabian Hueske
  • Odej Kao
  • Volker Markl
  • Erik Nijkamp
  • Daniel Warneke
چکیده

Large-scale data analysis applications require processing and analyzing of Terabytes or even Petabytes of data, particularly in the areas of web analysis or scientific data management. This trend has been discussed as “web-scale data management” in a panel at VLDB 2009. Formerly, parallel data processing was the domain of parallel database systems. Today, novel requirements like scaling out to thousands of machines, improved fault-tolerance, and schema free processing have made a case for new approaches. Among these approaches, the map/reduce programming model [4] and its open-source implementation Hadoop [1] have gained the most attention. Developed for simple logfile analysis, map/reduce systems execute sequential user code in a parallel and fault-tolerant manner, once it has been written to fit the second-order functions map and reduce. However, with the success of map/reduce, many projects have started to push more complex (e.g., SQL-like) operations into the programming model, thereby violating some of its initial design goals (i.e., separation of parallelization and user code) and paying significant performance penalties. To eliminate these shortcomings we have developed the Nephele/PACTs [3, 7] system, a parallel data processor centered around a programming model of so-called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele. Our system pursues the same design goals as map/reduce and is highlighted by three properties:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MapReduce and PACT - Comparing Data Parallel Programming Models

Web-Scale Analytical Processing is a much investigated topic in current research. Next to parallel databases, new flavors of parallel data processors have recently emerged. One of the most discussed approaches is MapReduce. MapReduce is highlighted by its programming model: All programs expressed as the second-order functions map and reduce can be automatically parallelized. Although MapReduce ...

متن کامل

Comparative Study on Parallel Data Processing for Resource Allocation in Cloud Computing

–Parallel data processing in cloud has emerged to be one killer application for infrastructure as service to integrate framework for products like portfolio, access these services and deploys the program. Scheduling job process in cloud computing for parallel data processing framework is Nephele. Our analysis presents expected performance of parallel job processing. Nephele is the processing fr...

متن کامل

Efficient Dynamic Resource Allocation Using Nephele in a Cloud Environment

Today, Infrastructure-as-a-Service (IaaS) cloud providers have incorporated parallel data processing framework in their clouds for performing Many-task computing (MTC) applications. Parallel data processing framework reduces time and cost in processing the substantial amount of users’ data. Nephele is a dynamic resource allocating parallel data processing framework, which is designed for dynami...

متن کامل

Nephele: A cloud platform for simplified, standardized, and reproducible microbiome data analysis.

Motivation Widespread interest in the study of the microbiome has resulted in data proliferation and the development of powerful computational tools. However, many scientific researchers lack the time, training, or infrastructure to work with large datasets or to install and use command line tools. Results The National Institute of Allergy and Infectious Diseases (NIAID) has created Nephele, ...

متن کامل

Comparing Data Processing Frameworks for Scalable Clustering

Recent advances in the development of data parallel platforms have provided a significant thrust to the development of scalable data mining algorithms for analyzing massive data sets that are now commonplace. Scalable clustering is a common data mining task that finds consequential applications in astronomy, biology, social network analysis and commercial domains. The variety of platforms avail...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • PVLDB

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2010